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ABSTRACT 



An apparatus for detecting the position of a human face in 
an input image or video image and a method thereof are 
provided. The apparatus includes an eye position detecting 
means for detecting pixels having a strong gray character- 
istic to determine areas having locality and texture charac- 
teristics as eye candidate areas among areas formed by the 
detected pixels, in an input red, blue, and green (RGB) 
image, a face position determining means for creating search 
templates by matching a model template to two areas 
extracted from the eye candidate areas, and determining an 
optimum search template among the created search tem- 
plates by using the value normalizing the sum of a prob- 
ability distance for the chromaticity of pixels within the area 
of a search template, and horizontal edge sizes calculated in 
the positions of the left and right eyes, a mouth and a nose 
estimated by the search template, and an extraction position 
stabilizing means for forming a minimum boundary rect- 
angle by the optimum search template, and increasing count 
values corresponding to the minimum boundary rectangle 
area and reducing count values corresponding to an area 
other than the minimum boundary rectangle area, among 
count values of individual pixels, stored in a shape memory, 
to output the area in which count values above a predeter- 
mined value are positioned, as eye and face areas. The 
apparatus is capable of accurately and quickly detecting a 
speaking person's eyes and face in an image, and is tolerant 
of image noise. 

32 Claims, 7 Drawing Sheets 
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APPARATUS AND METHOD FOR for eyes, nose, and mouth areas are provided according to 

DETECTING SPEAKING PERSON'S EYES the size of a teniplate, and features of interest are searched 

AND FACE through comparison with an input image in all areas within 

the template image. A problem that both methods have in 

This application claims priority under 35 U.S.C. §§119 5 common is that all areas in an image have to be searched 

and/or 365 to 99-55577 filed in Korea on Dec. 7, 1999; the with several model templates classified on the basis of size 

entire content of which is hereby incorporated by reference. or orientation for all areas of an image, since no information 

on size, orientation or location of eye or nose features is 

BACKGROUND OF THE INVENTION made available in the input image. This not only causes 

1. The Field of the Invention excessive computation, but also requires determining a 
This invention relates to image signal processing, and ^^eshold value for defining each area, and causes excessive 

more particularly to apparatus and method for interpreting ^ apphcaUon to an acuial system is made 

and extracting the features of human faces represented in aimcult. 

images input through camera sensor or video images, to U.S. Pat. No. 5,832,115 discloses that templates having 

detect the human face position within the images. two concentric ellipses for detecting facial ellipses may be 

2. Description of the Related Art ^^ed to detect a facial location through evaluating the size of 
n *i • *u * J r • 1 • * IV £ ij ^ edge contours which encircle the face in a region between 
Recently, m the study of artificial mteUigencc field, atten- i,- „ • . 

A I a u u c A • 1 *u the two elhpses. However, even m this case, the same 

tion and study has been locussed on implanting the recog- . i „ • *u * *u • j * . r n • 

. , u ■ u - * * 7 problem occurs m that the size and onentaUon of an ellip- 

nition capabiuty human bemgs have mto a computer to«rkf-i* -j j .j^ iii 

■I • , ^t- *L * if- I tical template has to be determined and searched through all 

endow intelligence on the computer or machine. In -fu- • 

A 4: . 1. 1 • .t_ L areas withm the image, 

particular, face recogmtion technology using the human ^ 

vision system has been very actively and widely studied ^^^^^ ^° overcome such problems in facial location 

throughout all fields related to computer vision and image detection, many recent studies have focussed on the use of 

processing, such as image processing, pattern recogmtion, „ ^^^^^ images. Based on the fact that, in most color images, 

and facial expression investigation. A technique for detect- * color value in the color of a face or skin approximates a 

ing faces and facial area is highly regarded in various general statistical value, the study of extracting candidate 

applied fields such as facial expression research, drivers' ^^^^^^ detection of skin color forms a mainstream 

drowsiness detection, entrance/exit control, or image index- J- (COMPAQ TR CRL9811, 1998) & references 

ing. Humans easily detect a facial area even in various and 30 therein). Recently, the studies have been successfully 

dynamic environments, while it is not an easy thing for appUed in color indexing, and facial tracking and extraction, 

computers to perform this, even in a relatively simple image However, the facial position extraction by a color is greaUy 

environment affected by image obtaining conditions such as a camera 

Representative approaches in previously proposed facial which acquires an image, iUumination color, andsurface and 

area detection methods include a method of using a neural 35 °^ '° object^ For example two different cameras give 

network (U.S. Pat. No. 5,680,481), a method of using the ^5^^°' ^^""^ vahies even m the san^e environment and for 

statistical features of facial brightness, such as a principal Pf parUcular, a face or skm color value 

component analysis of brightness (U.S. Pat. No. 5,710,833), s»g°^fi^/^tly changes depending on illmmnatioo. In a case m 

and a matching method proposed by T Poggio (IEEE ^.^'^^.^^ obtaining conditions is unknown, it is 

Transactions on Pattern Analysis and Machine InteUigence 40 detemnne the range of a skm color value for 

20, 1998). In order to employ the extracted face candidate f^^^^V"^ only face color region. Furthennore, a process of 

image as the input of a face recognition system, a means of determinmg only facial areas for similar skm colors which 

detecting the exact position of facial components or facial ^^'^ extracted, mcludmg background regions, is not 

features in the extracted face candidate region is required. lo * ^^^^^ ^""^ ^^^'^^ °^"^y subsequent processes, 

other words, in order to compare an input image with a 45 SUMMARY OF THE INVENTION 
model, position extraction and a size normalizing process 

for compensating for differences in size, angle, and orien- To solve the above problem, it is an objective of the 

tation of the facial image extracted from the input image present invention to provide an apparatus which is capable 

relative to a facial image of the model template are prereq- of accurately and quickly detecting a speaking person's eye 

uisite for enhanced recognition and matching efiBciency. In 50 ^iid face position, and which is tolerant of image noise, 

most face recognition systems, an eye area or the central It is another objective of the present invention to provide 

area of a pupil is used as a reference facial component in the a method of accurately and quickly detecting a speaking 

alignment and the normalizing processes. This is because person's eye and face. 

that features of the eye area remain unchanged compared Accordingly, to achieve the above objective, an apparatus 

with those of other facial components, even if a change 55 for a speaking person's eye and face detection according to 

occurs in the size, expression, attitude, lighting, etc., of a an embodiment of the present invention includes an eye 

facial image. position detecting means for detecting pixels having a strong 

Many studies on detecting the eye area or the central gray characteristic to determine areas having locality and 

position of the pupil from an image are ongoing. Methods texture characteristics as eye candidate areas among areas 

applied to conventional face recognition systems mainly 60 formed by the detected pixels, in an input red, blue, and 

adopt a pupil detection method. A representative pupil green (RGB), image, a face position determining means for 

detection method is to employ normalized correlation at all creating search templates by matching a model template to 

locations within an input image by making eye templates of two areas extracted from the eye candidate areas, and 

various sizes and forming a Gaussian pyramid image of the determining an optimum search template among the created 

input image. Furthermore, U.S. Pat. No. 5,680,481 and IEEE 65 search templates by using the value normalizing the sum of 

TPAMI19, 1997, by Moghaddam and TPoggio (IEEE a probability distance for the chromaticity of pixels within 

TPAMl 20, 1998) show a method in which eigen matrixes the area of a search template, and horizontal edge sizes 
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calculated in the positions of the left and right eyes, a mouth 
and a nose estimated by the search template, and an extrac- 
tion position stabilizing means for forming a minimum 
boimdary rectangle by the optimum search template, and 
increasing count values conesponding to the minimum 
boundary rectangle area and reducing count values corre- 
sponding to an area other than the minimum boundary 
rectangle area, among count values of individual pixels, 
stored in a shape memory, to output the area in which count 
values above a predetermined value are positioned, as eye 
and face areas. 

To achieve another objective of the present invention, a 
method of detecting a speaking person's eye and face 
includes the steps of detecting pixels having a strong gray 
characteristic to determine areas having locality and texture 
characteristics as eye candidate areas among areas formed 
by the detected pixels, in an input red, blue, and green 
(RGB) image, creating search templates by matching a 
model template to two areas extracted from the eye candi- 
date areas, and determining an optimum search template 
among the created search templates by using the value 
normalizing the sum of a probability distance for the chro- 
maticity of pixels within the area of a search template, and 
horizontal edge sizes in the positions of the left and right 
eyes, a mouth and a nose, estimated by the search template, 
in the RGB image, and forming a minimum boundary 
rectangle by the optimum search template, and increasing 
count values corresponding to the minimum boundary rect- 
angle area and reducing count values corresponding to an 
area other than the minimum boundary rectangle area, 
among count values of individual pixels, stored in a shape 
memory, to output the area, in which count values above a 
predetermined value are positioned, as eye and face areas. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The above objectives and advantages of the present 
invention will become more apparent by describing in detail 
a preferred embodiment thereof with reference to the 
attached drawings in which: 

FIG. 1 is a block diagram illustrating the overall configu- 
ration of the present invention; 

FIG. 2 is a detailed block diagram illustrating an eye 
position detector; 

FIG. 3 is a detailed block diagram illustrating a face 
position determiner, 

FIG. 4 is a detailed block diagram illustrating an extracted 
position stabilizer; 

FIG. 5 illustrates the brightness distribution of a face 
shape; 

FIGS. 6A-6D illustrate a process of detecting candidate 
eye areas; 

FIGS. 7A-7C illustrate a process of detecting a face 
position; and 

FIG. 8 illustrates the detection of a face position in a serial 
Moving Picture Experts Group (MPEG) image. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

In the present invention, an eye position, which is a 
representative facial feature, is extracted through the analy- 
sis of a common feature shown in a face which is obtained 
from various color images. Eyes in a face have a geometri- 
cally concave shape, so the brightness of the eyes represents 
a strong gray characteristic in an image. A representative 
color characteristic of the eyes is that the vahies of three 
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principal components of an input color are similar in mag- 
nitude and are very low in brightness at the eye position. 
Furthermore, the brightness difference of colors of the pupil 
of eyes and a face is densely and locally distributed, so that 

5 it is characterized by texmre in most images in which pupil 
contours are represented. In addition, since eye position is 
surrounded by a face color, it shows the characteristic of 
locality in which both color and texture characteristics 
locally occur. In the case of hair, it has a strong gray 
characteristic locally in the boundary portion, but such 
characteristic is shown wide and long. Vims, hair does not 
have the locality characteristic 

The present invention uses the above three main features 
as information for the initial detection of an eye position. 

J 5 After the eye position is detected through the combination of 
the three characteristics generated by the eyes, then the exact 
eye position is extracted through a combination of several 
subsequent processes and a face recognition process, and a 
face position is extracted using the information resulting 

20 therefrom. Furthermore, face position information extracted 
in this way may be used in an application of region of 
interest (ROI) in image transmission by a video phone. 

Referring to FIG. 1, an apparatus for enhancing aiijm age 
qji ality b y eye_a nd face detection accordmg to an embodi- 

25 ment of the present. mvention includes an eye position 
detector 10 for determining an eye position in an input 
image, a face position determiner 20 for forming a face 
template using the detected eye position candidate points 
and matching the face template with image data in order to 

30 determine the eye and face positions, and an extraction 
position stabilizer 30 for preventing the extracted eye and 
face positions from being significantly changed in an image. 

As shown in FIG. 2, the eye position detector 10 accord- 
ing to an embodiment of the present invention includes a 

35 color conversion imit 21, a strong gray extraction unit 22, a 
median filtering unit 23, an area formation unit 24, an area 
shape interpreting unit 25, a texture extraction unit 26, and 
an eye candidate determining unit 27. The color conversion 
unit 21 converts a video signal YUV of an input image to a 

40 three-color (RGB) signal. The strong gray extraction unit 22 
interprets the RGB signal of the image to extract pixels 
having a strong gray characteristic. The strong gray extrac- 
tion unit 21 uses features in which, if the difference between 
the maximum color value (MAXC) and the minimum color 

45 value (MING) of each color component, which represents a 
color for a pixel, is less than a predetermined value tl, and 
the value MAXC is less than another predetermined value 
t2, then the pixel represents a strong gray characteristic. 
Herein, when the values of each color component are 

50 represented in the range of 0-255, preferably, tl is deter- 
mined in the range of 55-65 and t2 is determined in the 
range of 90-110. However, the scope of the present inven- 
tion is not restricted to the above embodiment, and includes 
every known method of extracting strong gray pixels. 

55 The median filtering unit 23 filters the extracted pixels 
with a median filter to blur out spot noise. The area forma- 
tion uuit 24 groups connected pixels together to form areas 
to each of which a corresponding label is provided. The area 
shape interpreting unit 25 includes a circularity interpreting 

60 unit 25a, a height-width ratio interpreting unit 2Sb, and an 
area size interpreting unit 25c. The circularity interpreting 
unit 25fl interprets the shape of each labeled area to deter- 
mine whether or not the shape approximates a circle. The 
height-width ratio interpreting unit 25b calculates the 

65 height-width ratio of each labeled area, and the area size 
interpreting unit 25c: computes the relative size of each 
labeled area to examine the locality of each area. 
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The texture extraction unit 26 includes a morphology shape interpreting unit 45. The shape memory 43 stores 

interpreting unit 26a and a horizontal edge interpreting unit count values of the number of pixels corresponding to the 

2Sb, The morphology interpreting unit 26a uses a morphol- size of an input RGB image (the height of the image times 

ogy filter in each area to examine a texture characteristic by the width thereof). The MBR formation unit 41 forms a 

calculating a texture response. The horizontal edge inter- 5 minimum boundary rectangle in which a facial image within 

preting unit 26b extracts a horizontal edge using a horizontal ^n optimum search template is included. In the case of the 

edge filter. As a morphology filter, a minimum mojyholo^^^ ^^arch template, a rectangle is susceptible to rotation with 

filter IS preferably used (see M. Kunt, IEEE TCSVT 1998), ^o an image, depending on the relative positions of 

and a Sobel operator, which is a general differential filter, is ^ However, the MBR includes a facial 

used as a honzontd edge filter. Fmally, the eye candidate ^^^^^ determined by the optimum search template, and 

determming unit 27 determmes areas, among the labeled ^ ^^^^ orientation as an image regardless 

areas, m which locality and texture characteristics, rotation of a face 

respectively, are larger than predetermined values as eye r™ , . • . 

candidate areas shape memory renewal unit 42 mcreascs count values 

Referring to' HG. 3, the face position determiner 20 corresponding to the area of an MBR. among pbcel-based 

J' ^ I. J- ^ f . • " count values stored m the shape memory 43, and reduces 

accordmg to an embodiment of the present invention , , . .1. .i. ^ 

. , J r ^ 1 ^ n I- u'l*. count values correspondmg to areas other than the MBR. 

mcludes a face template creation unit 31, a probability -ru . , • . aa * * 

J. ^ ^. r ji c ^ '\ The tracking position extraction umt 44 outputs areas, m 

distance operation umt 32, an edge feature interpreting unit ... f , » .1. 1. ^ . • j 

*~ J ^. .* ifjx •* '^ir™ which count values greater than or equal to a predetermmed 

33 atKi an optimum search template determming unit 34. The , i.j i* * 

r: ^ ^ ^ , , , ^ 1 . value are located, as a speaking person s eye and face areas, 

face template creation unit 31 matches a model template «n t- j cl i_ * . 1 

, J J * *L r * ^ ^ _i rurthermore, the speed & shape mterpretmg umt 45 calcu- 

previously provided to the positions of two areas extracted 1 . j • j r.i_ . . t 

c *u j-j . * * u * 1 . lates the area and movmg speed of the MBR to control the 

from the eye candidate areas to create a search template on c ^ - T j..... 

.r»oT»' u -I-* * f .* range of values mcrcased or reduced by the shape memory 

an input RGB image by similarity transformation. renewal unit 42 j r j 
Preferably, the model template is formed of a rectangle of a 

face area, including two circles indicative of the left and ^5 The operation details of the present invention will be now 

right eyes, the base of which is located between noise and described. At the outset, a process of determining eye 

mouth portions candidate areas according to the present invention will be 

Hie probabmty distance operation unit 32 calculates the described with reference to FIGS 2, 5, and 6A-6D. Tlie 
sum of probability distances of the sHn colors at each pixel P^f^°^ invention uUbzcs mput video signals of a general 
of a face area using the values of the color difference signals 30 ^'^^"^ T^^* ""^^T o^T^ 
Cr and Cb of pixels within the search template area and "^f ^ ^^^^ ^ converted to a three-color (RGB) signal, 
previously trained statistical values, and then normalizes the ^ miportant characteristics considered for eye posi- 
sum of the probability distances over the size of search P^,^°J mvention are: the strong ^ay 
template. The edge feature interpreting unit 33 detects the characteristic of eyes, the honzontal edge or Jextiire char- 
horizontal edge feature of an input RGB image from esti- 3s f '"""''f \ ^^"^ ^"^"^''^ ''^j',!?"; 
mated locations of eyes, nose and mouth in a search tem- f° ^^^^^ 55*^^"*, *^ '^^^ characteristics. FIG 5 
plate. More specificaUy, the edge feaUire interpreting unil33 thirty-two of firontal face unages for sixteen people, 
detects a first horizontal edge size of the input RGB image, ^ '"^^^^^ per person, and an miage averagmg the 
corresponding to the estimated locations of a mouth and a ^^^^al face images As shown m each image of FIG. 5, the 
noise in the search template, and fiirthermore, detects a 40 concavity of leftand right eyes occurs m regions having ^ 
second horizontal edge size in the input RGB image corrc- ^^^^^^^^ shape. The miportant point m the stage of deter- 
sponding to an area matched witii the search template mming eye candidate arej^ is to extract eye candidate pomts 
excluding the eyes, nose and mouth locations. Hien, the ^^^^5^ combination of the three characteristics, 
edge feature interpreting unit 33 calculates an edge compo- ^^^G* *5A illustrates four representative images used in 
nent ratio that is a ratio of the first horizontal edge size to the 45 Moving Picture Experts Group (MPEG) video clips. The 
second horizontal edge size. In addition, the edge feature images are mainly head & shoulder views in which a head 
interpreting unit 33 can detect the horizontal edge size of and the upper part of body are shown. As shown in FIG. 6A, 
eyes, normalized over the size of circles indicative of eye Ihe eye portions of the images commonly represent a strong 
position. gray characteristic in which the portions are close to black. 

The optimum search template determining unit 34 sets 50 characteristic is caused by tiie fact that eyes have 

each predetermined weight to tiie nonnalized probabQity * geometrically concave shape. 

distance, edge component ratio and the normalized horizon- Therefore, the strong gray extraction unit 22 extracts 
tal edge size of the eyes, to determine a template having the pixels representing strong gray finom the color signal of an 
smallest sum thereof as an optimum search template. In the image, using a characteristic in which a pixel represents 
case in which an area in which a plurality of search tem- 55 strong gray if the difference between the maximum and 
plates are superimposed is located iodependentiy of an area minimum values of a color component representing color for 
in which other search templates are superimposed, the the pixel is small and brightness is distributed low, FIG. 6B 
optimum search template determining unit 34 determines shows the extraction of pixels representing the strong gray 
each optimum search template on the basis of an indepen- characteristic. Referring to FIG. 6B, strong gray pixels in 
dent area. This is because a plurality of faces may be 50 each image are indicated as white pixels by superimposing 
included within an image. them on the original image, and dark portions of the back- 
As shown in FIG. 4, the extraction position stabilizer 30 ground as well as eyes, in each image, are extracted 
according to an embodiment of the present invention When it comes to spatial distribution for the extracted 
includes a shape memory 43, a minimum boundary rect- pixels in image coordinates, the gray pixels of eye portions 
angle (MBR) formation unit 41, a shape memory renewal 65 are localized on the inside of the skin area of a face, while 
unit 42, and a tracking position extraction urut 44, and gray pixels of background or head portions occur in large 
furthermore, another embodiment may include a speed & lumps or scattered widely. This means that a locality char- 
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acleristic becomes the consisleot feature of the eye portion, by superimposing a face tenaplate according to the extracted 

and accordingly only eye candidate areas can be extracted eye candidate positions, and finally identifying a face area, 

using the locality characteristic. Next, a process of determining a face area will be now 

After performing median filtering and area labelling on described with reference to FIGS. 3. 7A-7C, and 8. FIGS, 

the edge pixels output from the strong gray extraction unit 5 7A-7C explains a process of determining an optimum 

22, the area shape interpreting unit 25 calculates the size, search template using a model template. FIG, 7B shows the 

circularity and height-width ratio of each area to remove shapes of a search face template matched by superimposing 

areas having no locality characteristic. In the circularity the search template on the selected eye candidate area. More 

measurement, it is necessary to search for areas whose particularly, a model template consists of a rectangular shape 

shapes are close to a circle, irrespective of the orientation or 10 the size of which is changeable, and two circular shapes 

size of the areas. Thus, an embodiment of the present indicative of an eye position within the rectangular shape. In 

invention preferably employs the following equations by addition, the base of the rectangle of the model template is 

Haralick [Computer & Robot V^ion Addition-Wesley Pub., located between noise and mouth. Once the position of a pair 

1992] as a standard of measuring circularity and height- of eye candidates is selected, the size, orientation, shape and 

width ratio having such a characteristic: 15 location of the model template in the image are determined, 

so that the model template is superimposed on the eye 

1 « _ (1) candidate area. Subsequently, it is determined whether or not 

~ rt Zj "^'^ - - - the selected eye candidate area actually represents eyes on a 

* face by investigating the colors and geometrical character- 

^ (2) 20 istics of image areas contained within the overlapped model 

-cO-Cr-Dll-zisl template. The model template similarity transforms into a 

search template with four factors. In this case, it is possible 
to determine a conversion factor because there are four 

In the equations (1) and (2), the two values and are equations and four unknowns, FIG. 7C indicates the finally 

defined in terms of pixels rj^ and Cj^ where k denotes the recognized eye position and detected face area, 

index for pixels within a shape and goes from 0 to n, and ( The following is a process of recognizing a face in the 

r, c) is the coordinate of an area center. The value of ^V°>? search template determined by the eye position. Firstly, a 

measured firom the two computed values indicates the cir- face takes on a skin color and the distribution of the skin 

cxdarity of the shape. If the value of fij^/oj^ in an area is less color of humans has a given range. Many studies demoo- 

than a predetermined value (The predetermined value is 1.2 strate that the reflection color of an object varies widely and. 

in the preferred embodiment of the present invention, but the largely depending on a change in illumination and shape, but 

scope of the invention is not restricted to it.), there is a high the color of a face or skin has a specified value and a 

hkelihood of representing an area of random shape, so the specified distribution in most images. In the light of this fact, 

corresponding area is excluded. it is possible to recognize face candidate areas by using 

The MBR of an area is calculated to compute a height- distribution of a skin color. It can be assumed that a face 

width ratio. The height-width ratio is limited so that areas color has a Gaussian distribution in a two-dimensional 

that are long in the vertical direction of an image are chrominance space. Thus, a skin color can be selected from 

removed. In an embodiment of the present invention, areas thousands of MPEG video images to calculate a statistical 

the height-width ratio of which is less than 0.7 or greater value. Using the computed statistical vahie. it is possible to 

than 3.5 are removed. Furthermore, areas in which the compute a probability distance indicatirig whether or not the 

number of pixels is greater than or equal to a predetermined internal areas of search templates superimposed as shown in 

value are excluded. In an embodiment of the present F^G. 7B are close to a skin color. In an embodiment of the 

invention, if the size of an area which is equivalent to the present invention, a mahalanobis distance is used as a 

number of pixels in the height of the image multiplied by the ^5 probabiHty distance, 
number of pixels in the width thereof^l,600) is greater than 

a predetermined value, the area is excluded. However, the w~{x-m) ix-fi) [ ) 

scope of the present invention is not restricted to the numeral ^^^^^^^ ^ ^ ^^^^^^ probability distance and 

hmit used m an embodmient thereof, ^^^^^ ^^^^ ^ ^^^^ ^^j^^ comprised of color difference 

FIG. 6C shows a texture characteristic detected by a 50 signals and respectively. Fiu-thermore, and 2 
morphology operator. Referring to FIG. 6C, a strong indicate the average vector of trained skin color, and the 
response (magnitude of brightness) of a texture character- variance matrix of trained value. As the sum of a mahal- 
istic is extracted due to the densely localized brightness anobis distance for the chromaticity of the inside, normal- 
difference. The texture characteristic is represented strongly ized to the size of a template, becomes less, there is a greater 
in an edge portion, not the boundary between areas. 55 likehhood of representing a face area. 
Furthermore, it can be found that a horizontal edge charac- Secondly, a mouth or a noise are positioned in the vicinity 
leristic consistentty exists, oonsideririg that the brightness of the central portion of the base of the rectangular search 
difference strongly occurs in the vicinity of eyes, in a template, and the horizontal edge component of this portion 
vertical direction. Thus, eye candidate areas can be finally fg targe. In contrast, the remaining face area portions of the 
determined by selecting only portiorK including strong hori- go search template except the mouth, noise and eye portions, 
zontal edge and texture characteristics among eye area have a comparatively even brightness distribution, and there 
candidates extracted by area shape interpretation. is do particular edge component. Thus, the ratio of horizon- 

FIG. 6D illustrates the final eye candidate points thus tal edge components calculated in both areas is used as a 

extracted. Since a face has left and right eyes, if the positions discrimination value. 

of both eyes are determined, then the size, direction and 65 Thirdly, the horizontal edge of an eye portion is relatively 

location of a face template to be compared can be deter- large. Thus, the horizontal edge size of the eyes normalized 

mined. In other words, the position of the eyes is determined by the size of circles indicative of the eye portion can be 
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used for identifying a face. With respect to several search tinuously increased until the value reaches a predetermined 

templates superimposed in FIG. 7B, the values of the above value, and when the value comes to the predetermined level, 

three factors, i.e. a mahalanobis distance, the ratio of a the value is maintained. Contrarily, in positions outside the 

horizontal edge component and the horizontal edge size of MBRs, a count value of the shape memory is repeatedly 

the eyes, are calculated, and then a search template, having 5 reduced imtil the value comes to zero where it is maintained, 

the smallest sum of the values, weighted corresponding to ^° embodiment of the present invention, the count value 

the importance of each factor, is selected as an optimum is in the range of 0-7. If the same process is repeatedly 

search template. If each search template is superimposed on Performed, only an object sequentially extracted in smiilar 

a plurality of eye candidate areas corresponding thereto, ^^.^^^^^ ^ extraction, 

only the search template that gives a minimum response is lo ^^^^ ,f ^^^^l repeatedly extracted m random posuions 

^ . A r ^i. f e ju • naturally has a low count value m the shape memory. Thus, 

extracted. Furthermore, if an area formed by supenmposmg ^^^^^^ determined to exist in only a portion 

a plurahty of search templates is located mdependendy of an ^^.^ ^^^^ ^^^^^^^ ^^^^ ^^^^ ^ ^ 

area formed by supermiposmg other search templates, it is ^ ^^^^^^ ^ predetermined threshold value. According to an 

determmed that two or more people exist, and an optimum embodiment of the present invention, only a portion indi- 

scarch template is determined on an area-by-arca basis. The 15 mating a count value above 3 is determined to be the position 

above processes facilitate detection of the eye position and jn which a face exists. However, the scope of the present 

face. In connection therewith, FIG. 8 exemplifies the posi- invention is not restricted to the range of the count value and 

tions of eyes and faces detected in typical serial MPEG threshold value chosen in the preferred embodiment for 

images in which head & shoxdders are shown. identifying the positions of eye and face. 

Finally, an extraction stabilization process will now be 20 The advantage of the shape cumulative memory is that 
described with reference to FIGS. 4 and 7C. The template of object detection and position stabilization can be simply 
the eye and face extracted through face recognition as shown accomplished and the operating speed is very fast, consid- 
in FIG. 7C requires stabilizing in a serial image. Natural ering its efiEciency. Furthermore, a count step can be con- 
image sequences always have image noise generated by trolled in such a way as to reduce or increase the count value 
several causes such as an environment condition for obtain- 25 depending on the size of MBR, the position of which is 
ing sequences, and factors within an image input apparatus. significantly changed or extracted, thereby adapting the 
Therefore, the image quality of two sequential images on a memory to the speed of a moving object, importance of an 
serial image input in quite a short time shows a different object, or a shape characteristic. For example, if an object 
characteristic in many aspects. The image noise character- moves more slowly or if the face size of an object is 
istic affects the computation of image feature value, so that 30 comparatively small, a count step is preferably made large, 
the feature value calculated in the image changes many In a moving image communication by a moving image 
times along a time axis. The efBciency for image recognition phone or mobile phone, a human face becomes the most 
and object detection is influenced by such instability factors, important region of interest. Thus, when creating an image 
and also the position of the template of the eye and face compressed by encoders such as MPEG-1, MPEG-2, 
shown in FIG. 7C tends to be extracted unstably in a serial 35 MPEG4, and H.263, the image quality in a face area can be 
image. In order to remove the instability factors, the present improved by using information of the extracted face area, 
invention uses a technique of accumulating the MBR posi- This means that the present invention can appropriately be 
tion information indicative of the boundary of an object to applied to an apparatus for controUing the entire amount of 
solve instability in a template extraction. transmitted information and maintaining a high resolution 

In general, an object having mass has a moment of inertia. 40 image by transmitting only a face portion which is an ROI 
When a human or an object moves in an image, a significant with a high resolution image, and transmitting the remaining 
moving change rarely occurs in the minute intervals of a background portion or a portion other than the ROI with a 
time axis. In particular, the spatial position of a human in a low resolution image or low amount of information, 
head & shoulder image is likely to be subsequenUy repre- The eye and face detection according to the preferred 
sented at a predetermined location, and seen fiom a serial 45 embodiment of the present invention may be embodied as a 
image, the position increasingly changes at slow speed. In computer program which can be executed on a computer, 
other words, there exists a temporal coherence for extracted and the program can be read out fi"om recording media in 
positions between sequential image frames. Seen from video which the program is recorded to execute it in a general- 
image obtained in units of 20-30 pieces per second, there are purpose digital computer system. The recording media 
few occasions when an object on the left in the i-th frame 50 includes magnetic storage media (e.g., ROM, floppy disk, 
appears suddenly on the right in the i+l-th frame. Using the hard disk, etc.), optical read memory (e.g., CD-ROM, 
temporal coherence of a time axis facilitates the extraction DVD), and carrier wave (e.g., transmission through 
position stabilization and the sequential extraction and trade- Internet). 

ing of MBR. The process can be simply implemented by The eye and face detection apparatus according to the 
using a shape memory technique which will be described in 55 present invention is capable of accurately and quickly 
the following- detecting the eyes and face in an image and is tolerant of 
Firstly, a shape memory having a space for storing the image noise. In other words, the present invention can be 
count value equivalent to the size of a video frame \s secured simultaneously applied to an image having a static back- 
to initialize count values corresponding to individual pixels. ground and an image having a dynamic background. In the 
Next, inputs from n MBRs of the extracted face are received 60 course of detecting eye and face positions, high-speed 
to increase the count vahies of a shape memory assigned to processing and paraUel processing are enabled by avoiding 
corresponding locations of pixels within the MBRs. Also in a search of the entire image. Furthermore, reliability 
the subsequent image, the same number of MBRs are input enhanced eye and face detection is allowed in combination 
to repeat the same process as for the preceding image. If the with movement detection, etc. The present invention can 
extracted MBRs are serially extracted in similar positions, 6S appropriately be used in appHcatioos such as video phones, 
count values in the corresponding positions of a shape monitoring systems requiring preservation of a face with a 
memory continues to be increased. A count value is con- high resolution image, and content-based image searching. 
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While this invention has been particularly shown and 
described with reference to a preferred embodiment thereof, 
it will be understood by those skilled in the art that various 
changes in form and details may be made therein without 
departing from the spirit and scope of the invention as 5 
defined by the appended claims. Therefore, the described 
embodiment should be considered not in terms of restriction 
but in terms of explanation. The scope of the present 
invention is limited not by the foregoing but by the follow- 
ing claims, and all differences within the range of equiva- lO 
lents thereof should be interpreted to be covered by the 
present invention. 

What is claimed is: 

1. An apparatus for detecting a speakiag person's eye and 
face, the apparatus comprising; 15 

an eye position detecting means for detecting pixels 
having a strong gray characteristic to determine areas 
having locality and texture characteristics as eye can- 
didate areas among areas formed by the detected pixels, 
in an input red, blue, and green (RGB) image; 20 

a face position determining means for creating search 
templates by matching a model template to two areas 
extracted from the eye candidate areas, and determin- 
ing an optimum search template among the created 
search templates by using the value normalizing the 
sum of a probability distance for the chromaticity of 
pixels within the area of a search template, and hori- 
zontal edge sizes calculated in the positions of the left 
and right eyes, a mouth and a nose estimated by the 
search template; and 

an exU'action position stabilizing means for forming a 
minimum boundary rectangle by the optimum search 
template, and increasing count values corresponding to 
the minimum boundary rectangle area and reducing 
count values corresponding to an area other than the 
minimum boundary rectangle area, among count values 
of individual pixels, stored in a shape memory, to 
output the area in which count values above a prede- 
termined value are positioned, as eye and face areas. ^ 

2. The apparatus of claim 1, wherein the eye position 
detecting means comprises: 

a strong gray extraction unit for interpreting an input RGB 
image signal to extract pixels that represent a strong 
gray characteristic; 45 

an area formation unit for forming areas by combining 
adjacent pixels with each other among the extracted 
pixels; 

an area shape interpreting unit for detecting a locality 
characteristic for each formed area; 50 

a texture extraction imit for detecting a texture character- 
istic for each formed area; and 

an eye candidate determining unit for determining areas in 
which the locality and texture characteristics, 
respectively, are greater than predetermined values as 
eye candidate areas, among the formed areas. 

3. The apparatus of claim 1, wherein the face position 
determining means comprises: 

a face template creation unit for creating search templates gQ 
by matching a previously provided model template to 
the positions of the two areas extracted from the eye 
candidate areas to perform similarity transformation on 
the matched model template to create a search template 
in an input RGB image; 55 

a probability distance operation unit for calculating a 
normalized probability distance for normalizing the 



sum of the probability distances for chromaticity of 
pixels within a search template area in an RGB image, 
with respect to the size of the search template; 

an edge feature interpreting unit for detecting horizontal 
edge feature values of an RGB image input from the 
positions of eyes, a nose, and a mouth estimated in the 
search template; and 

an optimum search template determining unit for deter- 
mining an optimum search template among a plurality 
of search templates created by the face template cre- 
ation unit, according to the values obtained by setting 
predetermined weights on the normalized probability 
distance and the horizontal edge feature values. 

4. The apparatus of claim 1, wherein the extraction 
position stabilizing means comprises: 

a shape memory for storing the count values of the 
number of pixels corresponding to the size of the input 
RGB image; 

a minimum boundary rectangle formation imit for form- 
ing a minimum boundary rectangle in which a face 
image is included within the optimum search template; 

a shape memory renewal unit for increasing the count 
values corresponding to an area of the minimum 
boundary rectangle area and reducing the count values 
corresponding to an area outside the minimum bound- 
ary rectangle area, among count values of individual 
pixels stored in the shape memory; and 

a traddng position extraction unit for outputting an area 
in which count values above a predetermined value are 
positioned in the shape memory as a speaking person's 
eye and face areas. 

5. The apparatus of claim 2, wherein the strong gray 
extraction unit extracts pixels of the RGB image, in each of 
which the difference between a maximum value and a 
minimum value of a color component representing a color is 
less than a predetermined value and the maximum value is 
less than another predetermined value, as pixels having a 
strong gray characteristic. 

6. The apparatus of claim 2, wherein the area shape 
interpreting unit comprises a circularity interpreting unit for 
computing a circularity value of each area, and 

wherein the eye candidate determining unit removes an 
area, the circularity value of which is less than a 
predetermined value, from the eye candidate areas. 

7. The apparatus of claim 2, wherein the area shape 
interpreting unit comprises a height-width ratio interpreting 
unit for computing the height-width ratio of each area; and 

wherein the eye candidate determining unit removes an 
area, the height-width ratio of which is less than a 
predetermined value or is greater than another prede- 
termined value, from the eye candidate areas. 

8. The apparatus of claim 2, wherein the area shape 
interpreting unit comprises an area size interpreting unit for 
computing the size of each area relative to the size of the 
overall image, and 

wherein the eye candidate determining unit removes an 
area, the relative size of which is greater than a prede- 
termined value, from the eye candidate areas. 

9. The apparatus of claim 2, wherein the texture extraction 
unit comprises a morphology interpreting unit with a mini- 
mum morphology filter for computing the texture response 
of each area; and 

wherein the eye candidate determining unit removes an 
area, the texture characteristic value of which is less 
than a predetermined value, from the eye candidate 
areas. 
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10. The apparatus of claim 2, wherein the texture extrac- 
tion unit comprises a horizontal edge interpreting unit with 
a differential filter for detecting the horizontal edge of each 
area; 

wherein the eye candidate determining unit removes an 5 
area, the horizontal edge characteristic value of which 
is less than a predetermined value, from the eye can- 
didate areas. 

U. The apparatus of claim 3, wherein the model template 
is formed of a rectangle including two circles indicative of 
the left and right eyes, in which the base of the rectangle is 
located between nose and mouth portions. 

12. The apparatus of claim 3, wherein the probability 
distance d is calculated by the following equation: 

IS 

where x is vector value of input color difference signals 
and Cfa, // is the average vector of previously trained skin 
color, and Z is the variance matrix of trained value. 

13, The apparatus of claim 3, wherein the edge feature 
interpreting unit detects a first horizontal edge size of the 
input RGB image corresponding to the mouth and nose 
positions estimated in the search template, and a second 
horizontal edge size of the input RGB image corresponding ^5 
to an area matched to the search template, except the 
positions of eyes, nose and mouth, and calculates the edge 
component ratio that normalizes the ratio of the first hori- 
zontal edge size to the second horizontal edge size. 

14. The apparatus of claim 13, wherein the edge feature 
interpreting unit detects the horizontal edge size of areas of 
the RGB image corresponding to eyes normalized over the 
size of the circles indicative of the eye position, and 

wherein the optimum search template determining unit 
determines a template, having the smallest sum of the 35 
normalized probability distance, the edge component 
ratio, and the normalized horizontal edge size of areas 
of the RGB image corresponding to the eyes which are 
each set with predetermined weights, as an optimum 
search template. 4q 

15, The apparatus of claim 3, wherein, if an area that is 
formed by superimposing a plurality of search templates is 
located independently of an area formed by superimposing 
other search templates, the optimims search template deter- 
mining unit determines optimum search templates of inde- 45 
pendent areas. 

16, The apparatus of claim 4, further comprising a speed 
& shape interpreting unit for computing the size and moving 
speed of the minimum boundary rectangle to control the 
range of values increased or reduced by the shape memory 55 
renewal unit. 

17. A method of detecting a speaking person's eye and 
face areas, the method comprising the steps of: 

(a) detecting pixels having a strong gray characteristic to 
determine areas having locality and texture character- 55 
istics as eye candidate areas among areas formed by the 
detected pixels, in an input red, blue, and green (RGB) 
image; 

(b) creating search templates by matching a model tem- 
plate to two areas extracted from the eye candidate 60 
areas, and determining an optimum search template 
among the created search templates by using the value 
normalizing the sum of a probability distance for the 
chromalicity of pixels wi±in the area of a search 
template, and horizontal edge sizes in the positions of 65 
the left and right eyes, a mouth and a nose, estimated 
by the search template, in the RGB image; and 
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(c) forming a minimum boundary rectangle by the opti- 
mum search template, and increasing count values 
corre^onding to the minimum boundary rectangle area 
and reducing count values corresponding to an area 
other than the minimum boundary rectangle area, 
among count values of individual pixels, stored in a 
shape memory, to output the area, in which count 
values above a predetermined value are positioned, as 
eye and face areas. 

18. The method of claim 17, wherein the step (a) com- 
prises the steps of: 

(al) interpreting an input RGB image signal to extract 
pixels that represent a strong gray diaracteristic; 

(a2) forming areas by combining adjacent pixels with 
each other among the extracted pixels; 

(a3) detecting a locality characteristic in each formed 
area; 

(a4) detecting a texture characteristic in each formed area; 
and 

(a5) determining areas, in which the locality and texture 
characteristics, respectively, are greater than predeter- 
mined values, among the formed areas, as eye candi- 
date areas. 

19. The method of claim 17, wherein the step (b) com- 
prises the steps of: 

(bl) creating search templates in the RGB image by 
matching a previoiisly provided model template to the 
positions of the two areas extracted from the eye 
candidate areas, to perform similarity transformation 
on the matched model template; 

(b2) calculating a normalized probability distance for 
normalizing the sum of the probability distance for 
chromatidty of pixels within a search template area by 
the size of the search template, in the RGB image; 

(b3) detecting horizontal edge feature values of the RGB 
image input from the positions of eyes, a nose, and a 
mouth estimated in the search template; and 

(b4) determining an optimum search template among a 
plurality of search templates created by the face tem- 
plate areation unit, by using the values obtained by 
setting predetermined weights on the normalized prob- 
ability distance and the horizontal edge feature value. 

20. The apparatus of claim 17, wherein the step (c) 
comprises the steps of 

(cl) forming the minimum boundary rectangle in which a 
face image is included within the optimum search 
template; 

(c2) increasing the count values conesponding to an area 
of the minimum boundary rectangle and reducing the 
count values corresponding to an area outside the 
minimum boundary rectangle area, among count values 
of individual pixels stored in the shape memory; and 

(c3) outputting an area in which count values above a 
predetermined value are positioned in the shape 
memory as a speaking person's eye and face areas. 

21. The method of claim 18, wherein, in the step (al), 
pixels of the RGB image, for each of which the difference 
between a maximum value and a minimum value of a color 
component representing a color is less than a predetermined 
value, and the maximum value is less than another prede- 
termined value, are extracted as pixels having a strong gray 
characteristic. 

22. The method of claim 18, wherein, in the step (a3), the 
circularity value of each area is calcuJated, and 

wherein, in the step (a5), an area, the circularity value of 
which is less than a predetermined value, is removed 
from the eye candidate areas. 
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23. The method of claim 18, wherein, in the step (a3), the 
height-width ratio of each area is calculated; and 

wherein an area, the height-width ratio of which is less 
than a predetermined value or is greater than another 
predetermined value, is removed from the eye candi- 
date areas. 

24. The method of claim 18, wherein, in the step (a3), the 
size of each area relative to the size of the overall image is 
calculated, and 

wherein, in the step (a5), an area, the relative size of 
which is greater than a predetermined value, is 
removed from the eye candidate areas. 

25. The method of claim 18, wherein, in the step (a4), the 
texture response of each area is calculated; and 

wherein, in the step (a5), an area, the texture characteristic 
value of which is less than a predetermined value, is 
removed from the eye candidate areas. 

26. The method of claim 18, wherein, in the step (a4), the 
horizontal edge of each area is detected; and 

wherein, in the step (a5), an area, the horizontal edge 
characteristic value of which is less than a predeter- 
mined value, is removed from the eye candidate areas. 

27. The method of claim 19, wherein ib& model template 
is formed of a rectangle including two circles indicative of 
the left and right eyes, the base of which is located between 
noise and mouth portions. 

28. The method of claim 19, wherein the probability 
distance d is calculated by the following equation: 

d2(;tHjc-/i)^->(x-^) 

where x is vector value of input color difference signals 
and Cfc, is the average vector of previously trained skin 
color, and 2 is the variance matrix of trained value. 
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29. The method of claim 19, wherein, in the step (b3), a 
first horizontal edge size of the input RGB image corre- 
sponding to the mouth and nose positions estimated in the 
search template, and a second horizontal edge size of the 

^ input RGB image corresponding to an area matched to the 
search template, except the positions of eyes, nose and 
mouth, are detected, and the edge component ratio that is a 
ratio of the first horizontal edge size to the second horizontal 

10 ^*^6® size is calculated. 

30. The method of claim 29, wherein the step (b3) further 
comprises the step of detecting the horizontal edge size of 
areas of the RGB image corresponding to normalized by the 
size of the circles indicative of the eye positions, and 

wherein, in the step (b4), a template, having the smallest 
sum of the normalized probability distance, the edge 
component ratio, and the normalized horizontal edge 
size of the areas of the RGB image corresponding to the 
20 eyes, which are each set with predetermined weights, is 
determined as an optimum search template. 

31. The method of claim 19, wherein, in the step (b4), if 
an area that is formed by superimposing a plurality of search 
templates is located iadependendy of an area formed by 

25 superimposing other search templates, the optimum search 
template determining unit determines optimum search tem- 
plates of independent areas. 

32. The method of claim 20, after the step (cl), further 
comprising the step of computing the size and moving speed 
of the minimum boundary rectangle to control the range of 
values increased or reduced by the shape memory renewal 
unit. 

***** 
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